• Clustering analysis con algoritmo DBSCAN provato 3 volte (salvato il csv con le labels ottenute per il primo e per il terzo) e con algoritmo OPTICS.

  • Algoritmo di geohashing.

  • Importo i dati

    In [352]:
    import numpy as np
    import pandas as pd
    from sklearn.preprocessing import StandardScaler
    from sklearn.cluster import DBSCAN
    from sklearn.cluster import OPTICS, cluster_optics_dbscan
    from sklearn import metrics
    from sklearn.datasets import make_blobs
    import matplotlib.gridspec as gridspec
    import matplotlib.pyplot as plt
    from sklearn import preprocessing
    import seaborn as sns
    import statistics
    
    In [272]:
    df = pd.read_csv('../df.csv')
    df
    
    Out[272]:
    Unnamed: 0 Screen_name UserID TweetID Coords Lat Lon Created_At Text
    0 x madikeeper12 868809325 779072240994234368 [43.72666207, 10.41268069] 43.726662 10.412681 Thu Sep 22 21:37:51 +0000 2016 Cieli infuocati.\n\n#picoftheday #quotesofthed...
    1 x madikeeper12 868809325 781615843406819329 [43.72666207, 10.41268069] 43.726662 10.412681 Thu Sep 29 22:05:13 +0000 2016 Prospettive.. \nunite a casa #ilselfone\n#team...
    2 x madikeeper12 868809325 781870800156499968 [43.72666207, 10.41268069] 43.726662 10.412681 Fri Sep 30 14:58:19 +0000 2016 Non occorre essere matti per lavorare qui, ma ...
    3 x madikeeper12 868809325 780003801260404736 [43.7167, 10.3833] 43.716700 10.383300 Sun Sep 25 11:19:32 +0000 2016 RunOnSunDay 🏃🏽‍♀️☀️\n#run #running #runner #ni...
    4 x madikeeper12 868809325 779443101123260417 [43.70561, 10.42059] 43.705610 10.420590 Fri Sep 23 22:11:31 +0000 2016 La vita è come la fotografia sono necessari i ...
    ... ... ... ... ... ... ... ... ... ...
    632 x antoniocassisa 358042635 781879291911016448 [43.7167, 10.3833] 43.716700 10.383300 Fri Sep 30 15:32:04 +0000 2016 I mì ómini \n#son #figli #boys @ Pisa, Italy h...
    633 x SefaMermer 293157588 780753755830677504 [43.7167, 10.3833] 43.716700 10.383300 Tue Sep 27 12:59:35 +0000 2016 #love #tbt #tagforlikes #TFLers #tweegram #pho...
    634 x SefaMermer 293157588 780756143668953088 [43.7167, 10.3833] 43.716700 10.383300 Tue Sep 27 13:09:05 +0000 2016 #love #tbt #tagforlikes #TFLers #tweegram #pho...
    635 x matteluca89 494389053 779638196258811904 [43.71544235, 10.40051616] 43.715442 10.400516 Sat Sep 24 11:06:45 +0000 2016 Last saturday I went out with my #chinese teac...
    636 x anabrmotta 98254561 781123690343698432 [43.72263, 10.3948] 43.722630 10.394800 Wed Sep 28 13:29:35 +0000 2016 Já que é pra tombar, ela tombou (só um pouquin...

    637 rows × 9 columns

    In [273]:
    df.drop(['Unnamed: 0'], axis='columns', inplace=True)
    df
    
    Out[273]:
    Screen_name UserID TweetID Coords Lat Lon Created_At Text
    0 madikeeper12 868809325 779072240994234368 [43.72666207, 10.41268069] 43.726662 10.412681 Thu Sep 22 21:37:51 +0000 2016 Cieli infuocati.\n\n#picoftheday #quotesofthed...
    1 madikeeper12 868809325 781615843406819329 [43.72666207, 10.41268069] 43.726662 10.412681 Thu Sep 29 22:05:13 +0000 2016 Prospettive.. \nunite a casa #ilselfone\n#team...
    2 madikeeper12 868809325 781870800156499968 [43.72666207, 10.41268069] 43.726662 10.412681 Fri Sep 30 14:58:19 +0000 2016 Non occorre essere matti per lavorare qui, ma ...
    3 madikeeper12 868809325 780003801260404736 [43.7167, 10.3833] 43.716700 10.383300 Sun Sep 25 11:19:32 +0000 2016 RunOnSunDay 🏃🏽‍♀️☀️\n#run #running #runner #ni...
    4 madikeeper12 868809325 779443101123260417 [43.70561, 10.42059] 43.705610 10.420590 Fri Sep 23 22:11:31 +0000 2016 La vita è come la fotografia sono necessari i ...
    ... ... ... ... ... ... ... ... ...
    632 antoniocassisa 358042635 781879291911016448 [43.7167, 10.3833] 43.716700 10.383300 Fri Sep 30 15:32:04 +0000 2016 I mì ómini \n#son #figli #boys @ Pisa, Italy h...
    633 SefaMermer 293157588 780753755830677504 [43.7167, 10.3833] 43.716700 10.383300 Tue Sep 27 12:59:35 +0000 2016 #love #tbt #tagforlikes #TFLers #tweegram #pho...
    634 SefaMermer 293157588 780756143668953088 [43.7167, 10.3833] 43.716700 10.383300 Tue Sep 27 13:09:05 +0000 2016 #love #tbt #tagforlikes #TFLers #tweegram #pho...
    635 matteluca89 494389053 779638196258811904 [43.71544235, 10.40051616] 43.715442 10.400516 Sat Sep 24 11:06:45 +0000 2016 Last saturday I went out with my #chinese teac...
    636 anabrmotta 98254561 781123690343698432 [43.72263, 10.3948] 43.722630 10.394800 Wed Sep 28 13:29:35 +0000 2016 Já que é pra tombar, ela tombou (só um pouquin...

    637 rows × 8 columns

    In [274]:
    # controllo il tipo di dato delle coordinate:
    type(df['Coords'][1])
    
    Out[274]:
    str
    In [275]:
    df['Coords'][1]
    
    Out[275]:
    '[43.72666207, 10.41268069]'

    Creo un dataframe contenente soltanto le coordinate (lat e long unite):

    In [276]:
    df_coords = df['Coords']
    df_coords
    
    Out[276]:
    0      [43.72666207, 10.41268069]
    1      [43.72666207, 10.41268069]
    2      [43.72666207, 10.41268069]
    3              [43.7167, 10.3833]
    4            [43.70561, 10.42059]
                      ...            
    632            [43.7167, 10.3833]
    633            [43.7167, 10.3833]
    634            [43.7167, 10.3833]
    635    [43.71544235, 10.40051616]
    636           [43.72263, 10.3948]
    Name: Coords, Length: 637, dtype: object

    Normalzzo i dati

    Creo il dataframe X che contiene soltanto la colonna della lat e della lon e poi normalizzo i dati per il clustering.

    In [277]:
    X = df[['Lat', 'Lon']]
    X
    
    Out[277]:
    Lat Lon
    0 43.726662 10.412681
    1 43.726662 10.412681
    2 43.726662 10.412681
    3 43.716700 10.383300
    4 43.705610 10.420590
    ... ... ...
    632 43.716700 10.383300
    633 43.716700 10.383300
    634 43.716700 10.383300
    635 43.715442 10.400516
    636 43.722630 10.394800

    637 rows × 2 columns

    In [278]:
    X_norm = StandardScaler().fit_transform(X)
    
    X_norm
    
    Out[278]:
    array([[ 1.21862633,  2.50548165],
           [ 1.21862633,  2.50548165],
           [ 1.21862633,  2.50548165],
           ...,
           [-0.46658272, -1.48740575],
           [-0.67932998,  0.85230065],
           [ 0.53655113,  0.07546446]])

    Clustering OPTICS

    OPTICS è un’evoluzione di dbscan e permette di non specificare il raggio.

    In [382]:
    # optics 1 clustering
    print ('optics 1, min_samples=10, default metric=minkowski')
    
    optics1 = OPTICS(min_samples=10).fit(X_norm) #default metric = minkowski
    labels_opt1 = optics1.labels_
    
    hist, bins = np.histogram(labels_opt1, bins=range(-1, len(set(labels_opt1)) + 1))
    
    print ('labels', dict(zip(bins, hist)))
    print ('silhouette', metrics.silhouette_score(X_norm, labels_opt1))
    print ('mean cluster dimension', statistics.mean(hist))
    print ('median cluster dimension', statistics.median(hist))
    print ('% outliers', dict(zip(bins, hist))[-1]*100/sum(hist))
    
    optics 1, min_samples=10, default metric=minkowski
    labels {-1: 122, 0: 161, 1: 63, 2: 13, 3: 10, 4: 14, 5: 16, 6: 17, 7: 16, 8: 16, 9: 10, 10: 16, 11: 132, 12: 11, 13: 20, 14: 0}
    silhouette 0.6297302946880552
    mean cluster dimension 39
    median cluster dimension 16.0
    % outliers 19.152276295133436
    
    /Users/ariannalisi/opt/anaconda3/lib/python3.8/site-packages/sklearn/cluster/_optics.py:804: RuntimeWarning: divide by zero encountered in true_divide
      ratio = reachability_plot[:-1] / reachability_plot[1:]
    
    In [280]:
    X_norm
    
    Out[280]:
    array([[ 1.21862633,  2.50548165],
           [ 1.21862633,  2.50548165],
           [ 1.21862633,  2.50548165],
           ...,
           [-0.46658272, -1.48740575],
           [-0.67932998,  0.85230065],
           [ 0.53655113,  0.07546446]])
    In [281]:
    # PLOT OPTICS 1
    colors = ['royalblue', 'maroon', 'forestgreen', 'mediumorchid', 'tan', 'deeppink', 'olive', 'goldenrod', 'lightcyan', 'navy', 'yellow', 'purple', 'black', 'grey', 'red', 'brown']
    vectorizer = np.vectorize(lambda x: colors[x % len(colors)])
    
    plt.scatter(X_norm[:,0], X_norm[:,1], c=vectorizer(labels_opt1), marker="o", picker=True)
    plt.title('Clustering\nOPTICS 1')
    plt.show()
    
    In [383]:
    # optics 2 clustering
    print ('optics 2, min_samples=50, xi=0.05, min_cluuster_size=0.05')
    
    optics2 = OPTICS(min_samples=50, xi=0.05, min_cluster_size=0.05).fit(X_norm) #default metric = minkowski
    labels_opt2 = optics2.labels_
    
    hist, bins = np.histogram(labels_opt2, bins=range(-1, len(set(labels_opt2)) + 1))
    
    print ('labels', dict(zip(bins, hist)))
    print ('silhouette', metrics.silhouette_score(X_norm, labels_opt2))
    print ('mean cluster dimension', statistics.mean(hist))
    print ('median cluster dimension', statistics.median(hist))
    print ('% outliers', dict(zip(bins, hist))[-1]*100/sum(hist))
    
    optics 2, min_samples=50, xi=0.05, min_cluuster_size=0.05
    labels {-1: 167, 0: 64, 1: 63, 2: 161, 3: 50, 4: 132, 5: 0}
    silhouette 0.5288974487338574
    mean cluster dimension 91
    median cluster dimension 64
    % outliers 26.21664050235479
    
    /Users/ariannalisi/opt/anaconda3/lib/python3.8/site-packages/sklearn/cluster/_optics.py:804: RuntimeWarning: divide by zero encountered in true_divide
      ratio = reachability_plot[:-1] / reachability_plot[1:]
    
    In [283]:
    # PLOT OPTICS 2
    colors = ['royalblue', 'maroon', 'forestgreen', 'mediumorchid', 'tan', 'deeppink', 'olive']
    vectorizer = np.vectorize(lambda x: colors[x % len(colors)])
    
    plt.scatter(X_norm[:,0], X_norm[:,1], c=vectorizer(labels_opt2), marker="o", picker=True)
    plt.title('Clustering\nOPTICS 2')
    plt.show()
    
    In [384]:
    # optics 3 clustering
    print ('optics 3, min_samples=10, metric=cityblock')
    
    optics3 = OPTICS(min_samples=10, metric='cityblock').fit(X_norm)
    labels_opt3 = optics3.labels_
    
    hist, bins = np.histogram(labels_opt3, bins=range(-1, len(set(labels_opt3)) + 1))
    
    print ('labels', dict(zip(bins, hist)))
    print ('silhouette', metrics.silhouette_score(X_norm, labels_opt3))
    print ('mean cluster dimension', statistics.mean(hist))
    print ('median cluster dimension', statistics.median(hist))
    print ('% outliers', dict(zip(bins, hist))[-1]*100/sum(hist))
    
    optics 3, min_samples=10, metric=cityblock
    labels {-1: 124, 0: 16, 1: 14, 2: 17, 3: 10, 4: 17, 5: 62, 6: 161, 7: 10, 8: 14, 9: 13, 10: 16, 11: 132, 12: 11, 13: 20, 14: 0}
    silhouette 0.6253645521254183
    mean cluster dimension 39
    median cluster dimension 16.0
    % outliers 19.46624803767661
    
    /Users/ariannalisi/opt/anaconda3/lib/python3.8/site-packages/sklearn/cluster/_optics.py:804: RuntimeWarning: divide by zero encountered in true_divide
      ratio = reachability_plot[:-1] / reachability_plot[1:]
    
    In [285]:
    # PLOT OPTICS 3
    colors = ['royalblue', 'maroon', 'forestgreen', 'mediumorchid', 'tan', 'deeppink', 'olive', 'goldenrod', 'lightcyan', 'navy', 'yellow', 'purple', 'black', 'grey', 'red', 'brown']
    vectorizer = np.vectorize(lambda x: colors[x % len(colors)])
    
    plt.scatter(X_norm[:,0], X_norm[:,1], c=vectorizer(labels_opt3), marker="o", picker=True)
    plt.title('Clustering\nOPTICS 3')
    plt.show()
    

    Tentativi su grafico per OPTICS preso da https://scikit-learn.org/stable/auto_examples/cluster/plot_optics.html FORSE LO TOLGO

    Continua a studiare il codice e a modificarlo per perfezionare i risultati! Di seguito:

    In [286]:
    from sklearn.cluster import OPTICS, cluster_optics_dbscan
    import matplotlib.gridspec as gridspec
    import matplotlib.pyplot as plt
    import numpy as np
    
    In [287]:
    labels_050 = cluster_optics_dbscan(
        reachability=optics1.reachability_,
        core_distances=optics1.core_distances_,
        ordering=optics1.ordering_,
        eps=0.5,
    )
    labels_100 = cluster_optics_dbscan(
        reachability=optics1.reachability_,
        core_distances=optics1.core_distances_,
        ordering=optics1.ordering_,
        eps=1,
    )
    
    #np.arange => returns evenly spaced values within a given interval.
    #Values are generated within the half-open interval [start, stop) 
    #(in other words, the interval including start but excluding stop). 
    #For integer arguments the function is equivalent to the Python built-in range function, 
    #but returns an ndarray rather than a list.
    space = np.arange(len(X_norm))
    reachability = optics1.reachability_[optics1.ordering_]
    labels_opt = optics1.labels_[optics1.ordering_]
    
    In [288]:
    plt.figure(figsize=(10, 10))
    G = gridspec.GridSpec(2, 2)
    ax1 = plt.subplot(G[0, :])
    ax2 = plt.subplot(G[1, :])
    
    # Reachability plot
    colors = ["g.", "r.", "b.", "y.", "c."]
    for klass, color in zip(range(0, 5), colors):
        Xk = space[labels_opt == klass]
        Rk = reachability[labels_opt == klass]
        ax1.plot(Xk, Rk, color, alpha=0.3)
    ax1.plot(space[labels_opt == -1], reachability[labels_opt == -1], "k.", alpha=0.3)
    ax1.plot(space, np.full_like(space, 1.0, dtype=float), "k-", alpha=0.5)
    ax1.plot(space, np.full_like(space, 0.5, dtype=float), "k-.", alpha=0.5)
    ax1.set_ylabel("Reachability (epsilon distance)")
    ax1.set_title("Reachability Plot")
    
    # OPTICS
    colors = ["g.", "r.", "b.", "y.", "c."]
    for klass, color in zip(range(0, 5), colors):
        Xk = X_norm[optics1.labels_ == klass]
        ax2.plot(Xk[:, 0], Xk[:, 1], color, alpha=0.3)
    ax2.plot(X_norm[optics1.labels_ == -1, 0], X_norm[optics1.labels_ == -1, 1], "k+", alpha=0.1)
    ax2.set_title("Automatic Clustering\nOPTICS")
    
    Out[288]:
    Text(0.5, 1.0, 'Automatic Clustering\nOPTICS')

    Lavoro su OPTICS 1: tolgo i dati con labels -1 che sono i noise points e quindi non ci interessano.

    In [289]:
    X_optics1 = X.copy() # creo un nuovo dataframe per aggiungere le lables ottenute dal clustering con optics1
    X_optics1
    
    Out[289]:
    Lat Lon
    0 43.726662 10.412681
    1 43.726662 10.412681
    2 43.726662 10.412681
    3 43.716700 10.383300
    4 43.705610 10.420590
    ... ... ...
    632 43.716700 10.383300
    633 43.716700 10.383300
    634 43.716700 10.383300
    635 43.715442 10.400516
    636 43.722630 10.394800

    637 rows × 2 columns

    In [290]:
    labels_optics1 = labels_opt1 #per praticità
    X_optics1.insert(2, "Labels", labels_optics1, True)
    X_optics1.sort_values(by=['Labels'])
    
    Out[290]:
    Lat Lon Labels
    0 43.726662 10.412681 -1
    224 43.722037 10.396777 -1
    414 43.709448 10.405744 -1
    407 43.724230 10.397150 -1
    398 43.723272 10.395619 -1
    ... ... ... ...
    335 43.698970 10.399040 13
    107 43.698970 10.399040 13
    596 43.698970 10.399040 13
    392 43.698420 10.400197 13
    135 43.698970 10.399040 13

    637 rows × 3 columns

    In [291]:
    X_optics1_cleaned = X_optics1.drop(X_optics1[X_optics1['Labels'] == -1].index)
    X_optics1_cleaned.sort_values(by=['Labels'])
    
    Out[291]:
    Lat Lon Labels
    200 43.723056 10.396417 0
    217 43.723056 10.396417 0
    220 43.723056 10.396417 0
    221 43.723056 10.396417 0
    222 43.723056 10.396417 0
    ... ... ... ...
    270 43.698970 10.399040 13
    98 43.698970 10.399040 13
    107 43.698970 10.399040 13
    31 43.698549 10.400072 13
    585 43.695943 10.398581 13

    515 rows × 3 columns

    In [292]:
    X_optics1_cleaned_norm = StandardScaler().fit_transform(X_optics1_cleaned)
    X_optics1_cleaned_norm
    
    Out[292]:
    array([[-0.49233594, -1.51756062,  1.17056282],
           [ 0.58686185,  0.54472564, -1.05917436],
           [ 0.58686185,  0.54472564, -1.05917436],
           ...,
           [-0.49233594, -1.51756062,  1.17056282],
           [-0.7058896 ,  1.18927353,  0.3597493 ],
           [ 0.51460017,  0.29054251, -0.45106422]])
    In [293]:
    # optics1 cleaned clustering
    print ('optics 1 cleaned (without noise)')
    
    optics1.fit(X_optics1_cleaned_norm)
    labels_optics1_cleaned = optics1.labels_
    
    hist, bins = np.histogram(labels_optics1_cleaned, bins=range(-1, len(set(labels_optics1_cleaned)) + 1))
    
    print ('labels', dict(zip(bins, hist)))
    print ('silhouette', metrics.silhouette_score(X_optics1_cleaned_norm, labels_optics1_cleaned))
    print ('mean cluster dimension', statistics.mean(hist))
    print ('median cluster dimension', statistics.median(hist))
    
    optics 1 cleaned (without noise)
    labels {-1: 11, 0: 132, 1: 16, 2: 15, 3: 17, 4: 16, 5: 14, 6: 10, 7: 13, 8: 63, 9: 161, 10: 16, 11: 11, 12: 20, 13: 0}
    silhouette 0.9478294385790876
    
    /Users/ariannalisi/opt/anaconda3/lib/python3.8/site-packages/sklearn/cluster/_optics.py:804: RuntimeWarning: divide by zero encountered in true_divide
      ratio = reachability_plot[:-1] / reachability_plot[1:]
    
    In [294]:
    # plot optics cleaned 1
    
    colors = ['royalblue', 'maroon', 'forestgreen', 'mediumorchid', 'tan', 'deeppink', 'olive', 'goldenrod', 'lightcyan', 'navy', 'yellow', 'purple', 'black', 'grey', 'red']
    vectorizer = np.vectorize(lambda x: colors[x % len(colors)])
    
    plt.scatter(X_optics1_cleaned_norm[:,0], X_optics1_cleaned_norm[:,1], c=vectorizer(labels_optics1_cleaned))
    plt.title('Clustering\nOPTICS 1 Cleaned')
    plt.show()
    

    Visualizzo i clusters sulla mappa con folium

    In [295]:
    import folium
    from folium import plugins
    
    import matplotlib.pyplot as plt
    import matplotlib.cm as cm
    import matplotlib.colors as colors
    
    import seaborn as sns
    
    lat = [43.7359, 43.6955]
    lon = [10.4269, 10.3686]
    
    lat_mean = np.mean(lat)
    lon_mean = np.mean(lon)
    
    lat, lng = (lat_mean, lon_mean)
    
    In [296]:
    map_clusters1 = folium.Map(location=[lat, lng], zoom_start=13.2)
    
    # set color scheme for the clusters
    x = np.arange(15)
    ys = [i + x + (i*x)**2 for i in range(15)]
    colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
    rainbow = [colors.rgb2hex(i) for i in colors_array]
    
    # add markers to the map
    markers_colors1 = []
    for lat, lng, cluster in zip(X_optics1['Lat'], X_optics1['Lon'],  
                                                X_optics1['Labels']):
        #label = folium.Popup(str(city)+ ','+str(state) + '- Cluster ' + str(cluster), parse_html=True)
        folium.vector_layers.CircleMarker(
            [lat, lng],
            radius=5,
            #popup=label,
            tooltip = 'Cluster ' + str(cluster),
            color=rainbow[cluster-1],
            fill=True,
            fill_color=rainbow[cluster-1],
            fill_opacity=0.9).add_to(map_clusters1)
    
    print('Mappa con clustering Optics 1 with noise')
    map_clusters1
    
    Mappa con clustering Optics 1 with noise
    
    Out[296]:
    Make this Notebook Trusted to load map: File -> Trust Notebook
    In [297]:
    map_clusters2 = folium.Map(location=[lat, lng], zoom_start=13.2)
    
    # set color scheme for the clusters
    x = np.arange(15)
    ys = [i + x + (i*x)**2 for i in range(15)]
    colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
    rainbow = [colors.rgb2hex(i) for i in colors_array]
    
    # add markers to the map
    markers_colors2 = []
    for lat, lng, cluster in zip(X_optics1_cleaned['Lat'], X_optics1_cleaned['Lon'],  
                                                X_optics1_cleaned['Labels']):
        #label = folium.Popup(str(city)+ ','+str(state) + '- Cluster ' + str(cluster), parse_html=True)
        folium.vector_layers.CircleMarker(
            [lat, lng],
            radius=5,
            #popup=label,
            tooltip = 'Cluster ' + str(cluster),
            color=rainbow[cluster-1],
            fill=True,
            fill_color=rainbow[cluster-1],
            fill_opacity=0.9).add_to(map_clusters2)
    
    print('Mappa con clustering Optics 1 cleaned')
    map_clusters2
    
    Mappa con clustering Optics 1 cleaned
    
    Out[297]:
    Make this Notebook Trusted to load map: File -> Trust Notebook

    DBSCAN

    Knee method with Nearest Neighbors

    In [298]:
    from sklearn.neighbors import NearestNeighbors
    
    neigh = NearestNeighbors(n_neighbors=2)
    nbrs = neigh.fit(X_norm)
    # .kneighbors finds the K-neighbors of a point and returns indices of and distances to the neighbors of each point.
    distances, indices = nbrs.kneighbors(X_norm) 
    
    distances
    
    Out[298]:
    array([[0.        , 0.        ],
           [0.        , 0.        ],
           [0.        , 0.        ],
           ...,
           [0.        , 0.        ],
           [0.        , 0.09374716],
           [0.        , 0.        ]])
    In [299]:
    distances = np.sort(distances, axis=0)
    distances = distances[:,1]
    plt.plot(distances)
    
    Out[299]:
    [<matplotlib.lines.Line2D at 0x7fa6616eadc0>]

    Sembrerebbe essere intorno a eps=0.2

    Knee method with pair wise distances between observations in n-dimensional space

    Utilizzo il knee method per trovare i parametri più adatti.

    In [300]:
    from scipy.spatial.distance import pdist, squareform
    
    In [301]:
    distance = pdist(X_norm, 'euclidean') #pair wise distances between observations in n-dimensional space
    print (distance)
    #squareform converts between condensed distance matrices and square distance matrices
    distance = squareform(distance) #distance matrix given the vector distance 
    print()
    print(distance)
    
    [0.         0.         4.333945   ... 2.34935894 1.85710549 1.44285874]
    
    [[0.         0.         0.         ... 4.333945   2.51699138 2.52392752]
     [0.         0.         0.         ... 4.333945   2.51699138 2.52392752]
     [0.         0.         0.         ... 4.333945   2.51699138 2.52392752]
     ...
     [4.333945   4.333945   4.333945   ... 0.         2.34935894 1.85710549]
     [2.51699138 2.51699138 2.51699138 ... 2.34935894 0.         1.44285874]
     [2.52392752 2.52392752 2.52392752 ... 1.85710549 1.44285874 0.        ]]
    
    In [302]:
    k = 5
    kth_distances = list()
    for d in distance:
        index_kth_distance = np.argsort(d)[k]
        kth_distances.append(d[index_kth_distance])
    
    In [303]:
    plt.plot(range(0, len(kth_distances)), sorted(kth_distances))
    plt.ylabel('dist from %sth neighbor' % k, fontsize=14)
    plt.xlabel('sorted distances', fontsize=14)
    plt.tick_params(axis='both', which='major', labelsize=14)
    plt.show()
    

    Sembrerebbe essere intorno a eps=0.2 o 0.3

    Prove Clustering

    In [385]:
    # prova 1: density based clustering
    print ('dbscan 1, eps 0.2')
    
    dbscan1 = DBSCAN(eps=0.2, min_samples=5)
    dbscan1.fit(X_norm)
    labels_dbscan1 = dbscan1.labels_
    
    hist, bins = np.histogram(labels_dbscan1, bins=range(-1, len(set(labels_dbscan1)) + 1))
    
    print ('labels', dict(zip(bins, hist)))
    print ('silhouette', metrics.silhouette_score(X_norm, labels_dbscan1))
    print ('mean cluster dimension', statistics.mean(hist))
    print ('median cluster dimension', statistics.median(hist))
    print ('% outliers', dict(zip(bins, hist))[-1]*100/sum(hist))
    
    dbscan 1, eps 0.2
    labels {-1: 62, 0: 132, 1: 306, 2: 45, 3: 14, 4: 17, 5: 9, 6: 6, 7: 11, 8: 16, 9: 6, 10: 7, 11: 6, 12: 0}
    silhouette 0.7014452720684127
    mean cluster dimension 45
    median cluster dimension 12.5
    % outliers 9.733124018838305
    
    In [ ]:
    # obtaining the centers of the clusters
    centroids = dbscan1.centers_
    # points array will be used to reach the index easy
    points = np.empty((0,len(data[0])), float)
    # distances will be used to calculate outliers
    distances = np.empty((0,len(data[0])), float)
    # getting points and distances
    for i, center_elem in enumerate(centroids):
        # cdist is used to calculate the distance between center and other points
        distances = np.append(distances, cdist([center_elem],data[clusters == i], 'euclidean')) 
        points = np.append(points, data[clusters == i], axis=0)
    
    In [305]:
    # plot dbscan 1
    
    colors = ['royalblue', 'maroon', 'forestgreen', 'mediumorchid', 'tan', 'deeppink', 'olive', 'goldenrod', 'lightcyan', 'navy', 'yellow', 'purple', 'black', 'grey']
    vectorizer = np.vectorize(lambda x: colors[x % len(colors)])
    
    plt.scatter(X_norm[:,0], X_norm[:,1], c=vectorizer(labels_dbscan1))
    plt.title('Clustering\nDBSCAN 1')
    plt.show()
    
    In [386]:
    # prova 2: density based clustering
    print ('dbscan 2, eps 0.3')
    
    dbscan2 = DBSCAN(eps=0.3, min_samples=5)
    dbscan2.fit(X_norm)
    labels_dbscan2 = dbscan2.labels_
    
    hist, bins = np.histogram(labels_dbscan2, bins=range(-1, len(set(labels_dbscan2)) + 1))
    
    print ('labels', dict(zip(bins, hist)))
    print ('silhouette', metrics.silhouette_score(X_norm, labels_dbscan2))
    print ('mean cluster dimension', statistics.mean(hist))
    print ('median cluster dimension', statistics.median(hist))
    print ('% outliers', dict(zip(bins, hist))[-1]*100/sum(hist))
    
    dbscan 2, eps 0.3
    labels {-1: 45, 0: 132, 1: 310, 2: 103, 3: 14, 4: 6, 5: 11, 6: 16, 7: 0}
    silhouette 0.73452704903901
    mean cluster dimension 70
    median cluster dimension 16
    % outliers 7.06436420722135
    
    In [307]:
    # plot dbscan 2
    
    colors = ['royalblue', 'maroon', 'forestgreen', 'mediumorchid', 'tan', 'deeppink', 'olive', 'goldenrod']
    vectorizer = np.vectorize(lambda x: colors[x % len(colors)])
    
    plt.scatter(X_norm[:,0], X_norm[:,1], c=vectorizer(labels_dbscan2))
    plt.title('Clustering\nDBSCAN 2')
    plt.show()
    
    In [387]:
    # prova 3: density based clustering
    print ('dbscan 3, eps 0.2m min_samples=3')
    
    dbscan3 = DBSCAN(eps=0.2, min_samples=3)
    dbscan3.fit(X_norm)
    labels_dbscan3 = dbscan3.labels_
    
    hist, bins = np.histogram(labels_dbscan3, bins=range(-1, len(set(labels_dbscan3)) + 1))
    
    print ('labels', dict(zip(bins, hist)))
    print ('silhouette', metrics.silhouette_score(X_norm, labels_dbscan3))
    print ('mean cluster dimension', statistics.mean(hist))
    print ('median cluster dimension', statistics.median(hist))
    print ('% outliers', dict(zip(bins, hist))[-1]*100/sum(hist))
    
    dbscan 3, eps 0.2m min_samples=3
    labels {-1: 24, 0: 4, 1: 132, 2: 4, 3: 306, 4: 45, 5: 14, 6: 4, 7: 18, 8: 3, 9: 9, 10: 6, 11: 4, 12: 3, 13: 11, 14: 4, 15: 16, 16: 4, 17: 4, 18: 3, 19: 6, 20: 7, 21: 6, 22: 0}
    silhouette 0.755482962735417
    mean cluster dimension 26
    median cluster dimension 6.0
    % outliers 3.767660910518053
    
    In [309]:
    # plot dbscan 3
    
    colors = ['royalblue', 'maroon', 'forestgreen', 'mediumorchid', 'tan', 'deeppink', 'olive', 'goldenrod']
    vectorizer = np.vectorize(lambda x: colors[x % len(colors)])
    
    plt.scatter(X_norm[:,0], X_norm[:,1], c=vectorizer(labels_dbscan3))
    plt.title('Clustering\nDBSCAN 3')
    plt.show()
    

    Salvo il df con le labels del Dbscan1

    In [310]:
    df_dbscan1 = df.copy()
    df_dbscan1.insert(8, "Labels", labels_dbscan1, True)
    #X_dbscan1.sort_values(by=['Labels'])
    df_dbscan1
    
    Out[310]:
    Screen_name UserID TweetID Coords Lat Lon Created_At Text Labels
    0 madikeeper12 868809325 779072240994234368 [43.72666207, 10.41268069] 43.726662 10.412681 Thu Sep 22 21:37:51 +0000 2016 Cieli infuocati.\n\n#picoftheday #quotesofthed... -1
    1 madikeeper12 868809325 781615843406819329 [43.72666207, 10.41268069] 43.726662 10.412681 Thu Sep 29 22:05:13 +0000 2016 Prospettive.. \nunite a casa #ilselfone\n#team... -1
    2 madikeeper12 868809325 781870800156499968 [43.72666207, 10.41268069] 43.726662 10.412681 Fri Sep 30 14:58:19 +0000 2016 Non occorre essere matti per lavorare qui, ma ... -1
    3 madikeeper12 868809325 780003801260404736 [43.7167, 10.3833] 43.716700 10.383300 Sun Sep 25 11:19:32 +0000 2016 RunOnSunDay 🏃🏽‍♀️☀️\n#run #running #runner #ni... 0
    4 madikeeper12 868809325 779443101123260417 [43.70561, 10.42059] 43.705610 10.420590 Fri Sep 23 22:11:31 +0000 2016 La vita è come la fotografia sono necessari i ... -1
    ... ... ... ... ... ... ... ... ... ...
    632 antoniocassisa 358042635 781879291911016448 [43.7167, 10.3833] 43.716700 10.383300 Fri Sep 30 15:32:04 +0000 2016 I mì ómini \n#son #figli #boys @ Pisa, Italy h... 0
    633 SefaMermer 293157588 780753755830677504 [43.7167, 10.3833] 43.716700 10.383300 Tue Sep 27 12:59:35 +0000 2016 #love #tbt #tagforlikes #TFLers #tweegram #pho... 0
    634 SefaMermer 293157588 780756143668953088 [43.7167, 10.3833] 43.716700 10.383300 Tue Sep 27 13:09:05 +0000 2016 #love #tbt #tagforlikes #TFLers #tweegram #pho... 0
    635 matteluca89 494389053 779638196258811904 [43.71544235, 10.40051616] 43.715442 10.400516 Sat Sep 24 11:06:45 +0000 2016 Last saturday I went out with my #chinese teac... 2
    636 anabrmotta 98254561 781123690343698432 [43.72263, 10.3948] 43.722630 10.394800 Wed Sep 28 13:29:35 +0000 2016 Já que é pra tombar, ela tombou (só um pouquin... 1

    637 rows × 9 columns

    In [311]:
    #saving the dataframe
    df_dbscan1.to_csv('../df_dbscan.csv')
    

    Salvo il df con le labels del Dbscan3

    In [312]:
    df_dbscan3 = df.copy()
    df_dbscan3.insert(8, "Labels", labels_dbscan3, True)
    #X_dbscan1.sort_values(by=['Labels'])
    df_dbscan3
    
    Out[312]:
    Screen_name UserID TweetID Coords Lat Lon Created_At Text Labels
    0 madikeeper12 868809325 779072240994234368 [43.72666207, 10.41268069] 43.726662 10.412681 Thu Sep 22 21:37:51 +0000 2016 Cieli infuocati.\n\n#picoftheday #quotesofthed... 0
    1 madikeeper12 868809325 781615843406819329 [43.72666207, 10.41268069] 43.726662 10.412681 Thu Sep 29 22:05:13 +0000 2016 Prospettive.. \nunite a casa #ilselfone\n#team... 0
    2 madikeeper12 868809325 781870800156499968 [43.72666207, 10.41268069] 43.726662 10.412681 Fri Sep 30 14:58:19 +0000 2016 Non occorre essere matti per lavorare qui, ma ... 0
    3 madikeeper12 868809325 780003801260404736 [43.7167, 10.3833] 43.716700 10.383300 Sun Sep 25 11:19:32 +0000 2016 RunOnSunDay 🏃🏽‍♀️☀️\n#run #running #runner #ni... 1
    4 madikeeper12 868809325 779443101123260417 [43.70561, 10.42059] 43.705610 10.420590 Fri Sep 23 22:11:31 +0000 2016 La vita è come la fotografia sono necessari i ... 2
    ... ... ... ... ... ... ... ... ... ...
    632 antoniocassisa 358042635 781879291911016448 [43.7167, 10.3833] 43.716700 10.383300 Fri Sep 30 15:32:04 +0000 2016 I mì ómini \n#son #figli #boys @ Pisa, Italy h... 1
    633 SefaMermer 293157588 780753755830677504 [43.7167, 10.3833] 43.716700 10.383300 Tue Sep 27 12:59:35 +0000 2016 #love #tbt #tagforlikes #TFLers #tweegram #pho... 1
    634 SefaMermer 293157588 780756143668953088 [43.7167, 10.3833] 43.716700 10.383300 Tue Sep 27 13:09:05 +0000 2016 #love #tbt #tagforlikes #TFLers #tweegram #pho... 1
    635 matteluca89 494389053 779638196258811904 [43.71544235, 10.40051616] 43.715442 10.400516 Sat Sep 24 11:06:45 +0000 2016 Last saturday I went out with my #chinese teac... 4
    636 anabrmotta 98254561 781123690343698432 [43.72263, 10.3948] 43.722630 10.394800 Wed Sep 28 13:29:35 +0000 2016 Já que é pra tombar, ela tombou (só um pouquin... 3

    637 rows × 9 columns

    In [313]:
    #saving the dataframe
    df_dbscan3.to_csv('../df_dbscan3.csv')
    

    Lavoro sul primo DBSCAN: tolgo i dati con labels -1 che sono i noise points e quindi non ci interessano.

    In [314]:
    X_dbscan1 = X.copy() # creo un nuovo dataframe per aggiungere le lables ottenute dal clustering con dbscan
    X_dbscan1
    
    Out[314]:
    Lat Lon
    0 43.726662 10.412681
    1 43.726662 10.412681
    2 43.726662 10.412681
    3 43.716700 10.383300
    4 43.705610 10.420590
    ... ... ...
    632 43.716700 10.383300
    633 43.716700 10.383300
    634 43.716700 10.383300
    635 43.715442 10.400516
    636 43.722630 10.394800

    637 rows × 2 columns

    In [315]:
    X_dbscan1.insert(2, "Labels", labels_dbscan1, True)
    
    In [316]:
    X_dbscan1_cleaned = X_dbscan1.drop(X_dbscan1[X_dbscan1['Labels'] == -1].index)
    X_dbscan1_cleaned.sort_values(by=['Labels'])
    
    Out[316]:
    Lat Lon Labels
    3 43.71670 10.38330 0
    167 43.71670 10.38330 0
    483 43.71670 10.38330 0
    480 43.71670 10.38330 0
    171 43.71670 10.38330 0
    ... ... ... ...
    626 43.71266 10.39692 11
    627 43.71266 10.39692 11
    628 43.71266 10.39692 11
    629 43.71266 10.39692 11
    624 43.71266 10.39692 11

    575 rows × 3 columns

    In [317]:
    X_dbscan1_cleaned_norm = StandardScaler().fit_transform(X_dbscan1_cleaned)
    X_dbscan1_cleaned_norm
    
    Out[317]:
    array([[-0.5283139 , -1.6009732 , -0.73282031],
           [ 0.59706207,  0.44330365, -0.30372492],
           [ 0.59706207,  0.44330365, -0.30372492],
           ...,
           [-0.5283139 , -1.6009732 , -0.73282031],
           [-0.75100539,  1.08222286,  0.12537048],
           [ 0.52170836,  0.19134023, -0.30372492]])
    In [365]:
    # density based clustering
    print ('dbscan 1 cleaned (without noise)')
    
    dbscan1.fit(X_dbscan1_cleaned_norm)
    labels_dbscan1_cleaned = dbscan1.labels_
    
    hist, bins = np.histogram(labels_dbscan1_cleaned, bins=range(-1, len(set(labels_dbscan1_cleaned)) + 1))
    
    print ('labels', dict(zip(bins, hist)))
    print ('silhouette', metrics.silhouette_score(X_dbscan1_cleaned_norm, labels_dbscan1_cleaned))
    print ('mean cluster dimension', statistics.mean(hist))
    print ('median cluster dimension', statistics.median(hist))
    
    dbscan 1 cleaned (without noise)
    labels {-1: 6, 0: 132, 1: 306, 2: 35, 3: 14, 4: 16, 5: 8, 6: 6, 7: 11, 8: 16, 9: 6, 10: 6, 11: 7, 12: 6, 13: 0}
    silhouette 0.8590055439680456
    mean cluster dimension 38
    median cluster dimension 8
    
    In [319]:
    # plot dbscan cleaned 1
    
    colors = ['royalblue', 'maroon', 'forestgreen', 'mediumorchid', 'tan', 'deeppink', 'olive', 'goldenrod', 'lightcyan', 'navy', 'yellow', 'purple', 'black', 'grey', 'red']
    vectorizer = np.vectorize(lambda x: colors[x % len(colors)])
    
    plt.scatter(X_dbscan1_cleaned_norm[:,0], X_dbscan1_cleaned_norm[:,1], c=vectorizer(labels_dbscan1_cleaned))
    plt.title('Clustering\nDBSCAN 1 Cleaned')
    plt.show()
    
    In [320]:
    X_dbscan1_cleaned["Coords"] = "[" + X_dbscan1_cleaned["Lat"].astype(str) + " , " + X_dbscan1_cleaned["Lon"].astype(str) + "]"
    X_dbscan1_cleaned
    
    Out[320]:
    Lat Lon Labels Coords
    3 43.716700 10.383300 0 [43.7167 , 10.3833]
    5 43.723056 10.396417 1 [43.72305556 , 10.39641667]
    6 43.723056 10.396417 1 [43.72305556 , 10.39641667]
    7 43.720750 10.396940 1 [43.72075 , 10.396939999999999]
    8 43.723056 10.396417 1 [43.72305556 , 10.39641667]
    ... ... ... ... ...
    632 43.716700 10.383300 0 [43.7167 , 10.3833]
    633 43.716700 10.383300 0 [43.7167 , 10.3833]
    634 43.716700 10.383300 0 [43.7167 , 10.3833]
    635 43.715442 10.400516 2 [43.71544235 , 10.40051616]
    636 43.722630 10.394800 1 [43.72263 , 10.3948]

    575 rows × 4 columns

    Visualizzo i clusters sulla mappa con folium

    In [321]:
    import folium
    from folium import plugins
    
    import matplotlib.pyplot as plt
    import matplotlib.cm as cm
    import matplotlib.colors as colors
    
    import seaborn as sns
    
    In [322]:
    map_clusters3 = folium.Map(location=[lat, lng], zoom_start=13.2)
    
    # set color scheme for the clusters
    x = np.arange(15)
    ys = [i + x + (i*x)**2 for i in range(15)]
    colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
    rainbow = [colors.rgb2hex(i) for i in colors_array]
    
    # add markers to the map
    markers_colors = []
    for lat, lng, cluster in zip(X_dbscan1['Lat'], X_dbscan1['Lon'],  
                                                X_dbscan1['Labels']):
        #label = folium.Popup(str(city)+ ','+str(state) + '- Cluster ' + str(cluster), parse_html=True)
        folium.vector_layers.CircleMarker(
            [lat, lng],
            radius=5,
            #popup=label,
            tooltip = 'Cluster ' + str(cluster),
            color=rainbow[cluster-1],
            fill=True,
            fill_color=rainbow[cluster-1],
            fill_opacity=0.9).add_to(map_clusters3)
    
    print('Mappa con clustering Dbscan 1 with noise')
    map_clusters3
    
    Mappa con clustering Dbscan 1 with noise
    
    Out[322]:
    Make this Notebook Trusted to load map: File -> Trust Notebook
    In [323]:
    map_clusters4 = folium.Map(location=[lat, lng], zoom_start=13.2)
    
    # set color scheme for the clusters
    x = np.arange(15)
    ys = [i + x + (i*x)**2 for i in range(15)]
    colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
    rainbow = [colors.rgb2hex(i) for i in colors_array]
    
    # add markers to the map
    markers_colors = []
    for lat, lng, cluster in zip(X_dbscan1_cleaned['Lat'], X_dbscan1_cleaned['Lon'],  
                                                X_dbscan1_cleaned['Labels']):
        #label = folium.Popup(str(city)+ ','+str(state) + '- Cluster ' + str(cluster), parse_html=True)
        folium.vector_layers.CircleMarker(
            [lat, lng],
            radius=5,
            #popup=label,
            tooltip = 'Cluster ' + str(cluster),
            color=rainbow[cluster-1],
            fill=True,
            fill_color=rainbow[cluster-1],
            fill_opacity=0.9).add_to(map_clusters4)
    
    print('Mappa con clustering Dbscan 1 cleaned')
    map_clusters4
    
    Mappa con clustering Dbscan 1 cleaned
    
    Out[323]:
    Make this Notebook Trusted to load map: File -> Trust Notebook

    GEOHASH

    In [324]:
    import pygeohash as pgh
    import geohash as gh
    import geopandas as gpd
    from polygon_geohasher.polygon_geohasher import geohash_to_polygon 
    
    In [325]:
    X_geohash = X.copy() # creo un nuovo dataframe per aggiungere le lables ottenute dal clustering con dbscan
    X_geohash
    
    Out[325]:
    Lat Lon
    0 43.726662 10.412681
    1 43.726662 10.412681
    2 43.726662 10.412681
    3 43.716700 10.383300
    4 43.705610 10.420590
    ... ... ...
    632 43.716700 10.383300
    633 43.716700 10.383300
    634 43.716700 10.383300
    635 43.715442 10.400516
    636 43.722630 10.394800

    637 rows × 2 columns

    In [326]:
    X_geohash['geohash'] = X_geohash.apply(lambda x: gh.encode(x.Lat, x.Lon, precision=6), axis=1)
    

    Conto i codici geohash per poi creare una mappa e creare la scala:

    In [327]:
    count_geohash = []
    codes_geohash = X_geohash['geohash'].tolist()
    codes_geohash.sort()
    codes_geohash
    
    Out[327]:
    ['spz2s7',
     'spz2s7',
     'spz2s7',
     'spz2s7',
     'spz2sd',
     'spz2sd',
     'spz2sd',
     'spz2sd',
     'spz2sd',
     'spz2sd',
     'spz2se',
     'spz2se',
     'spz2se',
     'spz2se',
     'spz2se',
     'spz2se',
     'spz2se',
     'spz2se',
     'spz2se',
     'spz2se',
     'spz2se',
     'spz2se',
     'spz2se',
     'spz2se',
     'spz2sm',
     'spz2sm',
     'spz2sp',
     'spz2sp',
     'spz2sp',
     'spz2sp',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sq',
     'spz2sr',
     'spz2sr',
     'spz2sr',
     'spz2sr',
     'spz2sr',
     'spz2ss',
     'spz2st',
     'spz2st',
     'spz2st',
     'spz2st',
     'spz2st',
     'spz2st',
     'spz2st',
     'spz2st',
     'spz2st',
     'spz2st',
     'spz2st',
     'spz2st',
     'spz2st',
     'spz2st',
     'spz2st',
     'spz2st',
     'spz2st',
     'spz2st',
     'spz2st',
     'spz2st',
     'spz2st',
     'spz2st',
     'spz2st',
     'spz2st',
     'spz2st',
     'spz2st',
     'spz2st',
     'spz2st',
     'spz2st',
     'spz2st',
     'spz2st',
     'spz2st',
     'spz2sv',
     'spz2sv',
     'spz2sv',
     'spz2sv',
     'spz2sv',
     'spz2sv',
     'spz2sw',
     'spz2sw',
     'spz2sw',
     'spz2sw',
     'spz2sw',
     'spz2sw',
     'spz2sw',
     'spz2sw',
     'spz2sw',
     'spz2sw',
     'spz2sw',
     'spz2sw',
     'spz2sw',
     'spz2sw',
     'spz2sw',
     'spz2sw',
     'spz2sw',
     'spz2sw',
     'spz2sw',
     'spz2sw',
     'spz2sw',
     'spz2sw',
     'spz2sw',
     'spz2sw',
     'spz2sw',
     'spz2sw',
     'spz2sw',
     'spz2sw',
     'spz2sw',
     'spz2sw',
     'spz2sw',
     'spz2sw',
     'spz2sw',
     'spz2sw',
     'spz2sw',
     'spz2sw',
     'spz2sw',
     'spz2sw',
     'spz2sw',
     'spz2sw',
     'spz2sw',
     'spz2sw',
     'spz2sw',
     'spz2sw',
     'spz2sw',
     'spz2sw',
     'spz2sw',
     'spz2sw',
     'spz2sw',
     'spz2sw',
     'spz2sw',
     'spz2sw',
     'spz2sw',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sx',
     'spz2sy',
     'spz2sy',
     'spz2sy',
     'spz2sy',
     'spz2sy',
     'spz2sy',
     'spz2sy',
     'spz2sy',
     'spz2sy',
     'spz2sy',
     'spz2sy',
     'spz2sy',
     'spz2sz',
     'spz2th',
     'spz2th',
     'spz2th',
     'spz2th',
     'spz2tp',
     'spz2tp',
     'spz2u2',
     'spz2u2',
     'spz2u2',
     'spz2u2',
     'spz2u2',
     'spz2u2',
     'spz2u2',
     'spz2u2',
     'spz2u2',
     'spz2u2',
     'spz2u2',
     'spz2u2',
     'spz2u2',
     'spz2u2',
     'spz2u2',
     'spz2u2',
     'spz2u2',
     'spz2u2',
     'spz2u2',
     'spz2u2',
     'spz2u2',
     'spz2u2',
     'spz2u2',
     'spz2u2',
     'spz2u2',
     'spz2u2',
     'spz2u2',
     'spz2u8',
     'spz2u8',
     'spz2ub',
     'spz2ub',
     'spz2ub',
     'spz2ub',
     'spz2ub',
     'spz2ub',
     'spz2ub',
     'spz2v0']
    In [328]:
    values = []
    
    for code_geohash in codes_geohash:
        values.append(codes_geohash.count(code_geohash))
        
    values_array = np.array(values)
    
    values = set(values_array)
    values
    
    Out[328]:
    {1, 2, 4, 5, 6, 7, 12, 14, 27, 32, 53, 136, 318}
    In [329]:
    #per ora non uso l'array normalizzato!
    norm = np.linalg.norm(values_array)
    normal_array = values_array/norm
    print(normal_array)
    
    [0.00067729 0.00067729 0.00067729 0.00067729 0.00101593 0.00101593
     0.00101593 0.00101593 0.00101593 0.00101593 0.00237051 0.00237051
     0.00237051 0.00237051 0.00237051 0.00237051 0.00237051 0.00237051
     0.00237051 0.00237051 0.00237051 0.00237051 0.00237051 0.00237051
     0.00033864 0.00033864 0.00067729 0.00067729 0.00067729 0.00067729
     0.02302777 0.02302777 0.02302777 0.02302777 0.02302777 0.02302777
     0.02302777 0.02302777 0.02302777 0.02302777 0.02302777 0.02302777
     0.02302777 0.02302777 0.02302777 0.02302777 0.02302777 0.02302777
     0.02302777 0.02302777 0.02302777 0.02302777 0.02302777 0.02302777
     0.02302777 0.02302777 0.02302777 0.02302777 0.02302777 0.02302777
     0.02302777 0.02302777 0.02302777 0.02302777 0.02302777 0.02302777
     0.02302777 0.02302777 0.02302777 0.02302777 0.02302777 0.02302777
     0.02302777 0.02302777 0.02302777 0.02302777 0.02302777 0.02302777
     0.02302777 0.02302777 0.02302777 0.02302777 0.02302777 0.02302777
     0.02302777 0.02302777 0.02302777 0.02302777 0.02302777 0.02302777
     0.02302777 0.02302777 0.02302777 0.02302777 0.02302777 0.02302777
     0.02302777 0.02302777 0.02302777 0.02302777 0.02302777 0.02302777
     0.02302777 0.02302777 0.02302777 0.02302777 0.02302777 0.02302777
     0.02302777 0.02302777 0.02302777 0.02302777 0.02302777 0.02302777
     0.02302777 0.02302777 0.02302777 0.02302777 0.02302777 0.02302777
     0.02302777 0.02302777 0.02302777 0.02302777 0.02302777 0.02302777
     0.02302777 0.02302777 0.02302777 0.02302777 0.02302777 0.02302777
     0.02302777 0.02302777 0.02302777 0.02302777 0.02302777 0.02302777
     0.02302777 0.02302777 0.02302777 0.02302777 0.02302777 0.02302777
     0.02302777 0.02302777 0.02302777 0.02302777 0.02302777 0.02302777
     0.02302777 0.02302777 0.02302777 0.02302777 0.02302777 0.02302777
     0.02302777 0.02302777 0.02302777 0.02302777 0.02302777 0.02302777
     0.02302777 0.02302777 0.02302777 0.02302777 0.00084661 0.00084661
     0.00084661 0.00084661 0.00084661 0.00016932 0.0054183  0.0054183
     0.0054183  0.0054183  0.0054183  0.0054183  0.0054183  0.0054183
     0.0054183  0.0054183  0.0054183  0.0054183  0.0054183  0.0054183
     0.0054183  0.0054183  0.0054183  0.0054183  0.0054183  0.0054183
     0.0054183  0.0054183  0.0054183  0.0054183  0.0054183  0.0054183
     0.0054183  0.0054183  0.0054183  0.0054183  0.0054183  0.0054183
     0.00101593 0.00101593 0.00101593 0.00101593 0.00101593 0.00101593
     0.00897406 0.00897406 0.00897406 0.00897406 0.00897406 0.00897406
     0.00897406 0.00897406 0.00897406 0.00897406 0.00897406 0.00897406
     0.00897406 0.00897406 0.00897406 0.00897406 0.00897406 0.00897406
     0.00897406 0.00897406 0.00897406 0.00897406 0.00897406 0.00897406
     0.00897406 0.00897406 0.00897406 0.00897406 0.00897406 0.00897406
     0.00897406 0.00897406 0.00897406 0.00897406 0.00897406 0.00897406
     0.00897406 0.00897406 0.00897406 0.00897406 0.00897406 0.00897406
     0.00897406 0.00897406 0.00897406 0.00897406 0.00897406 0.00897406
     0.00897406 0.00897406 0.00897406 0.00897406 0.00897406 0.05384434
     0.05384434 0.05384434 0.05384434 0.05384434 0.05384434 0.05384434
     0.05384434 0.05384434 0.05384434 0.05384434 0.05384434 0.05384434
     0.05384434 0.05384434 0.05384434 0.05384434 0.05384434 0.05384434
     0.05384434 0.05384434 0.05384434 0.05384434 0.05384434 0.05384434
     0.05384434 0.05384434 0.05384434 0.05384434 0.05384434 0.05384434
     0.05384434 0.05384434 0.05384434 0.05384434 0.05384434 0.05384434
     0.05384434 0.05384434 0.05384434 0.05384434 0.05384434 0.05384434
     0.05384434 0.05384434 0.05384434 0.05384434 0.05384434 0.05384434
     0.05384434 0.05384434 0.05384434 0.05384434 0.05384434 0.05384434
     0.05384434 0.05384434 0.05384434 0.05384434 0.05384434 0.05384434
     0.05384434 0.05384434 0.05384434 0.05384434 0.05384434 0.05384434
     0.05384434 0.05384434 0.05384434 0.05384434 0.05384434 0.05384434
     0.05384434 0.05384434 0.05384434 0.05384434 0.05384434 0.05384434
     0.05384434 0.05384434 0.05384434 0.05384434 0.05384434 0.05384434
     0.05384434 0.05384434 0.05384434 0.05384434 0.05384434 0.05384434
     0.05384434 0.05384434 0.05384434 0.05384434 0.05384434 0.05384434
     0.05384434 0.05384434 0.05384434 0.05384434 0.05384434 0.05384434
     0.05384434 0.05384434 0.05384434 0.05384434 0.05384434 0.05384434
     0.05384434 0.05384434 0.05384434 0.05384434 0.05384434 0.05384434
     0.05384434 0.05384434 0.05384434 0.05384434 0.05384434 0.05384434
     0.05384434 0.05384434 0.05384434 0.05384434 0.05384434 0.05384434
     0.05384434 0.05384434 0.05384434 0.05384434 0.05384434 0.05384434
     0.05384434 0.05384434 0.05384434 0.05384434 0.05384434 0.05384434
     0.05384434 0.05384434 0.05384434 0.05384434 0.05384434 0.05384434
     0.05384434 0.05384434 0.05384434 0.05384434 0.05384434 0.05384434
     0.05384434 0.05384434 0.05384434 0.05384434 0.05384434 0.05384434
     0.05384434 0.05384434 0.05384434 0.05384434 0.05384434 0.05384434
     0.05384434 0.05384434 0.05384434 0.05384434 0.05384434 0.05384434
     0.05384434 0.05384434 0.05384434 0.05384434 0.05384434 0.05384434
     0.05384434 0.05384434 0.05384434 0.05384434 0.05384434 0.05384434
     0.05384434 0.05384434 0.05384434 0.05384434 0.05384434 0.05384434
     0.05384434 0.05384434 0.05384434 0.05384434 0.05384434 0.05384434
     0.05384434 0.05384434 0.05384434 0.05384434 0.05384434 0.05384434
     0.05384434 0.05384434 0.05384434 0.05384434 0.05384434 0.05384434
     0.05384434 0.05384434 0.05384434 0.05384434 0.05384434 0.05384434
     0.05384434 0.05384434 0.05384434 0.05384434 0.05384434 0.05384434
     0.05384434 0.05384434 0.05384434 0.05384434 0.05384434 0.05384434
     0.05384434 0.05384434 0.05384434 0.05384434 0.05384434 0.05384434
     0.05384434 0.05384434 0.05384434 0.05384434 0.05384434 0.05384434
     0.05384434 0.05384434 0.05384434 0.05384434 0.05384434 0.05384434
     0.05384434 0.05384434 0.05384434 0.05384434 0.05384434 0.05384434
     0.05384434 0.05384434 0.05384434 0.05384434 0.05384434 0.05384434
     0.05384434 0.05384434 0.05384434 0.05384434 0.05384434 0.05384434
     0.05384434 0.05384434 0.05384434 0.05384434 0.05384434 0.05384434
     0.05384434 0.05384434 0.05384434 0.05384434 0.05384434 0.05384434
     0.05384434 0.05384434 0.05384434 0.05384434 0.05384434 0.05384434
     0.05384434 0.05384434 0.05384434 0.05384434 0.05384434 0.05384434
     0.05384434 0.05384434 0.05384434 0.05384434 0.05384434 0.05384434
     0.05384434 0.05384434 0.05384434 0.05384434 0.05384434 0.05384434
     0.05384434 0.05384434 0.05384434 0.05384434 0.05384434 0.05384434
     0.05384434 0.05384434 0.05384434 0.05384434 0.05384434 0.05384434
     0.05384434 0.05384434 0.05384434 0.05384434 0.05384434 0.05384434
     0.05384434 0.05384434 0.05384434 0.05384434 0.05384434 0.00203186
     0.00203186 0.00203186 0.00203186 0.00203186 0.00203186 0.00203186
     0.00203186 0.00203186 0.00203186 0.00203186 0.00203186 0.00016932
     0.00067729 0.00067729 0.00067729 0.00067729 0.00033864 0.00033864
     0.00457169 0.00457169 0.00457169 0.00457169 0.00457169 0.00457169
     0.00457169 0.00457169 0.00457169 0.00457169 0.00457169 0.00457169
     0.00457169 0.00457169 0.00457169 0.00457169 0.00457169 0.00457169
     0.00457169 0.00457169 0.00457169 0.00457169 0.00457169 0.00457169
     0.00457169 0.00457169 0.00457169 0.00033864 0.00033864 0.00118525
     0.00118525 0.00118525 0.00118525 0.00118525 0.00118525 0.00118525
     0.00016932]
    

    Prove:

    In [330]:
    decoded_location = gh.decode(X_geohash['geohash'][0])
    decoded_location2 = gh.decode(X_geohash['geohash'][1])
    
    In [331]:
    decoded_location
    
    Out[331]:
    (43.72833251953125, 10.4095458984375)
    In [332]:
    decoded_location2
    
    Out[332]:
    (43.72833251953125, 10.4095458984375)
    In [333]:
    gh.neighbors(X_geohash['geohash'][0])
    
    Out[333]:
    ['spz2u8',
     'spz2v0',
     'spz2sz',
     'spz2sx',
     'spz2tp',
     'spz2uc',
     'spz2u9',
     'spz2v1']
    In [334]:
    gh.neighbors(X_geohash['geohash'][200])
    
    Out[334]:
    ['spz2sr',
     'spz2sz',
     'spz2sw',
     'spz2sq',
     'spz2sy',
     'spz2u8',
     'spz2u2',
     'spz2ub']

    Creo un dataframe per poi costruire la griglia con i codici geohash sulla mappa folium:

    In [335]:
    import json
    # Create Geo Pandas DataFrame
    df_geo = gpd.GeoDataFrame({'location':df_coords.tolist(), 'value': values_array})
    df_geo['geohash'] = X_geohash['geohash']
    df_geo['geometry'] = df_geo['geohash'].apply(geohash_to_polygon)
    df_geo.crs = {'init': 'epsg:4326'}
    
    
    print('features.properties.geohash')
    display(json.loads(df_geo.to_json())['features'][0])
    display(df_geo.head())
    
    features.properties.geohash
    
    /Users/ariannalisi/opt/anaconda3/lib/python3.8/site-packages/pyproj/crs/crs.py:53: FutureWarning: '+init=<authority>:<code>' syntax is deprecated. '<authority>:<code>' is the preferred initialization method. When making the change, be mindful of axis order changes: https://pyproj4.github.io/pyproj/stable/gotchas.html#axis-order-changes-in-proj-6
      return _prepare_from_string(" ".join(pjargs))
    
    {'id': '0',
     'type': 'Feature',
     'properties': {'geohash': 'spz2ub',
      'location': '[43.72666207, 10.41268069]',
      'value': 4},
     'geometry': {'type': 'Polygon',
      'coordinates': [[[10.404052734375, 43.7255859375],
        [10.4150390625, 43.7255859375],
        [10.4150390625, 43.7310791015625],
        [10.404052734375, 43.7310791015625],
        [10.404052734375, 43.7255859375]]]}}
    location value geohash geometry
    0 [43.72666207, 10.41268069] 4 spz2ub POLYGON ((10.40405 43.72559, 10.41504 43.72559...
    1 [43.72666207, 10.41268069] 4 spz2ub POLYGON ((10.40405 43.72559, 10.41504 43.72559...
    2 [43.72666207, 10.41268069] 4 spz2ub POLYGON ((10.40405 43.72559, 10.41504 43.72559...
    3 [43.7167, 10.3833] 4 spz2sq POLYGON ((10.38208 43.71460, 10.39307 43.71460...
    4 [43.70561, 10.42059] 6 spz2th POLYGON ((10.41504 43.70361, 10.42603 43.70361...
    In [336]:
    import folium
    
    lat = [43.7359, 43.6955]
    lon = [10.4269, 10.3686]
    
    lat_mean = np.mean(lat)
    lon_mean = np.mean(lon)
    
    lat, lng = (lat_mean, lon_mean)
    
    m = folium.Map((lat, lng), zoom_start=13.2)
    folium.Choropleth(geo_data=df_geo, 
                      name='choropleth',
                      data=df_geo,
                      columns=['geohash', 'value'],
                      key_on='feature.properties.geohash',
                      fill_color='YlGn',
                      fill_opacity=0.7,
                      line_opacity=0.2,
                      legend_name='asdf').add_to(m)
    m
    
    Out[336]:
    Make this Notebook Trusted to load map: File -> Trust Notebook
    In [337]:
    df_geohash = df.copy()
    df_geohash.insert(8, "geohash", X_geohash['geohash'] , True)
    #df_geohash.sort_values(by=['Labels'])
    df_geohash
    
    Out[337]:
    Screen_name UserID TweetID Coords Lat Lon Created_At Text geohash
    0 madikeeper12 868809325 779072240994234368 [43.72666207, 10.41268069] 43.726662 10.412681 Thu Sep 22 21:37:51 +0000 2016 Cieli infuocati.\n\n#picoftheday #quotesofthed... spz2ub
    1 madikeeper12 868809325 781615843406819329 [43.72666207, 10.41268069] 43.726662 10.412681 Thu Sep 29 22:05:13 +0000 2016 Prospettive.. \nunite a casa #ilselfone\n#team... spz2ub
    2 madikeeper12 868809325 781870800156499968 [43.72666207, 10.41268069] 43.726662 10.412681 Fri Sep 30 14:58:19 +0000 2016 Non occorre essere matti per lavorare qui, ma ... spz2ub
    3 madikeeper12 868809325 780003801260404736 [43.7167, 10.3833] 43.716700 10.383300 Sun Sep 25 11:19:32 +0000 2016 RunOnSunDay 🏃🏽‍♀️☀️\n#run #running #runner #ni... spz2sq
    4 madikeeper12 868809325 779443101123260417 [43.70561, 10.42059] 43.705610 10.420590 Fri Sep 23 22:11:31 +0000 2016 La vita è come la fotografia sono necessari i ... spz2th
    ... ... ... ... ... ... ... ... ... ...
    632 antoniocassisa 358042635 781879291911016448 [43.7167, 10.3833] 43.716700 10.383300 Fri Sep 30 15:32:04 +0000 2016 I mì ómini \n#son #figli #boys @ Pisa, Italy h... spz2sq
    633 SefaMermer 293157588 780753755830677504 [43.7167, 10.3833] 43.716700 10.383300 Tue Sep 27 12:59:35 +0000 2016 #love #tbt #tagforlikes #TFLers #tweegram #pho... spz2sq
    634 SefaMermer 293157588 780756143668953088 [43.7167, 10.3833] 43.716700 10.383300 Tue Sep 27 13:09:05 +0000 2016 #love #tbt #tagforlikes #TFLers #tweegram #pho... spz2sq
    635 matteluca89 494389053 779638196258811904 [43.71544235, 10.40051616] 43.715442 10.400516 Sat Sep 24 11:06:45 +0000 2016 Last saturday I went out with my #chinese teac... spz2sw
    636 anabrmotta 98254561 781123690343698432 [43.72263, 10.3948] 43.722630 10.394800 Wed Sep 28 13:29:35 +0000 2016 Já que é pra tombar, ela tombou (só um pouquin... spz2sx

    637 rows × 9 columns

    In [338]:
    #saving the dataframe
    df_geohash.to_csv('../df_geohash.csv')
    
    In [339]:
    # raggruppo il df in base ai codici geohash per cambiare nome in numeri e renderli più comprensibili
    Geohash_codes = df_geohash.groupby(['geohash'])
    
    In [340]:
    lista = []
    i = 0
    
    for key, items in Geohash_codes:
        df_geohash['geohash'] = df_geohash['geohash'].replace(key, i)
        i = i+1
    
    In [341]:
    df_geohash
    
    Out[341]:
    Screen_name UserID TweetID Coords Lat Lon Created_At Text geohash
    0 madikeeper12 868809325 779072240994234368 [43.72666207, 10.41268069] 43.726662 10.412681 Thu Sep 22 21:37:51 +0000 2016 Cieli infuocati.\n\n#picoftheday #quotesofthed... 18
    1 madikeeper12 868809325 781615843406819329 [43.72666207, 10.41268069] 43.726662 10.412681 Thu Sep 29 22:05:13 +0000 2016 Prospettive.. \nunite a casa #ilselfone\n#team... 18
    2 madikeeper12 868809325 781870800156499968 [43.72666207, 10.41268069] 43.726662 10.412681 Fri Sep 30 14:58:19 +0000 2016 Non occorre essere matti per lavorare qui, ma ... 18
    3 madikeeper12 868809325 780003801260404736 [43.7167, 10.3833] 43.716700 10.383300 Sun Sep 25 11:19:32 +0000 2016 RunOnSunDay 🏃🏽‍♀️☀️\n#run #running #runner #ni... 5
    4 madikeeper12 868809325 779443101123260417 [43.70561, 10.42059] 43.705610 10.420590 Fri Sep 23 22:11:31 +0000 2016 La vita è come la fotografia sono necessari i ... 14
    ... ... ... ... ... ... ... ... ... ...
    632 antoniocassisa 358042635 781879291911016448 [43.7167, 10.3833] 43.716700 10.383300 Fri Sep 30 15:32:04 +0000 2016 I mì ómini \n#son #figli #boys @ Pisa, Italy h... 5
    633 SefaMermer 293157588 780753755830677504 [43.7167, 10.3833] 43.716700 10.383300 Tue Sep 27 12:59:35 +0000 2016 #love #tbt #tagforlikes #TFLers #tweegram #pho... 5
    634 SefaMermer 293157588 780756143668953088 [43.7167, 10.3833] 43.716700 10.383300 Tue Sep 27 13:09:05 +0000 2016 #love #tbt #tagforlikes #TFLers #tweegram #pho... 5
    635 matteluca89 494389053 779638196258811904 [43.71544235, 10.40051616] 43.715442 10.400516 Sat Sep 24 11:06:45 +0000 2016 Last saturday I went out with my #chinese teac... 10
    636 anabrmotta 98254561 781123690343698432 [43.72263, 10.3948] 43.722630 10.394800 Wed Sep 28 13:29:35 +0000 2016 Já que é pra tombar, ela tombou (só um pouquin... 11

    637 rows × 9 columns

    In [342]:
    #saving the dataframe
    df_geohash.to_csv('../df_geohash_to_numbers.csv')
    
    In [ ]: